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Abstract 

Background: Staphylococcus aureus is both human commensal and an important human pathogen, responsible for 
community-acquired and nosocomial infections ranging from superficial wound infections to invasive infections, such 
as osteomyelitis, bacteremia and endocarditis, pneumonia or toxin shock syndrome with a mortality rate up to 40%. 5. 
aureus reveals a high genetic polymorphism and detecting the genotypes is extremely useful to manage and prevent 
possible outbreaks and to understand the route of infection. One of current and expanded typing method is based on 
the X region of the spa gene composed of a succession of repeats of 21 to 27 bp. More than 10000 types are known. 
Extracting the repeats is impossible by hand and needs a dedicated software. Unfortunately the only software on the 
market is a commercial program from Ridom. 

Findings: This article presents DNAGear, a free and open source software with a user friendly interface written all in 
Java on top of NetBeans Platform to perform spa typing, detecting new repeats and new spa types and synchronizing 
automatically the files with the open access database. The installation is easy and the application is platform 
independent. In fact, the SPA identification is a formal regular expression matching problem and the results are 1 00% 
exact. As the program is using Java embedded modules written over string manipulation of well established 
algorithms, the exactitude of the solution is perfectly established. 

Conclusions: DNAGear is able to identify the types of the 5. aureus sequences and detect both new types and 
repeats. Comparing to manual processing, which is time consuming and error prone, this application saves a lot of 
time and effort and gives very reliable results. Additionally, the users do not need to prepare the forward-reverse 
sequences manually, or even by using additional tools. They can simply create them in DNAGear and perform the 
typing task. In short, researchers who do not have commercial software will benefit a lot from this application. 



Findings 

Background 

Since recent decades, the epidemiological situation 
has changed: high prevalence of hospital-acquired 
methicillin-resistant Staphylococcus aureus (MRSA) 
in some countries and hospital units, outbreaks of 
community-acquired MRSA unrelated to the hospital, 
emergence of S. aureus from pigs transmitted to humans 
with colonization and some cases of human infection, 
emergence of MRSA with intermediate susceptibility to 
glycopeptides (GISA and heteroVISA) and description 
of strains resistant to vancomycin (VRSA) or linezolid, 
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and the emergence of virulent strains responsible for 
shock toxinic, necrotizing pneumonia with a high mor- 
tality. Molecular typing is essential to detect the genetic 
polymorphism at microscale for molecular epidemiol- 
ogy study, especially in outbreak investigations, but also 
at a macroscale for phylogenetic and population-based 
analyses. Numerous molecular typing methods have 
been developed for S. aureus, including Pulsed-Field Gel 
Electrophoresis (PFGE), MultiLocus Sequence Typing 
(MLST), but also typing based on sequence polymor- 
phisms of the following loci: the Staphylococcal Cassette 
Chromosome mecA (SCCrnec), the accessory gene reg- 
ulator (agr) and the X region encoding protein A (spa 
gene). In the last ten years, the spa-typing based on the 
X region of the protein A gene has become a standard 
method for S. aureus, see examples in [1-4]. The X region 
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of spa gene is composed of variable numbers of repeats of 
21 to 27 base pairs (24-bp repeat being the most common 
one). The polymorphism of this region is based on the 
deletions or duplications of these repeats and punctual 
mutations. Each unique sequence of repeats is denned by 
a value, and a unique combination of values identifies a 
spa-type. Numerous papers showed the usefulness of this 
region to characterize both micro- and macro-variation 
in S. aureus [1-8]. Nowadays more than 500 repeats and 
more than 10000 spa-types were described according to 
the sequence and organization of repeats, considering 
repeat polymorphism, association of repeats and number 
of repeats. To permit a global approach and to define a 
uniform code terminology of spa-vepe&ts and -types, [9] 
developed the software Ridom StaphType that synchro- 
nizes either directly via the http-protocol or file based 
(e.g. via e-mail) with an accompanying SpaServer that 
functions as operative source for all new spa-repeat and 
-types codes. In this context, our objective was to provide 
a free and open source software synchronized with Ridom 
StaphType public database to determine from the raw spa 
sequence the repeats and types according to those already 
described in SpaServer and to identify the new ones. The 
X region of each S. aureus isolate was amplified by PCR 
as described by [6]. 

Implementation 

Spa sequence typing involves three main operations: Iden- 
tifying sequence type, detecting new types and discover- 
ing new repeats. DNAGear is an application that is written 
in Java and built on top of NetBeans Platform which 
performs all of these operations. It offers both, an easy 
environment for users to work, and a modular underlying 
for developers for further improvements. In addition to 
these operations, the sequence forward-reverse alignment 
task is supported by the application. 

File Format 

DNAGear is a project-based application, Figure 1, in 
which the user can create multiple projects. Each project 
has input and output text files. Input files are: repeats, 
types, and sequences files. Each file has a set of entries. 
Each entry has name and value. The repeat 's value entry 
is a short DNA sequence, while the types one is a set of 
repeats' numbers (part of the repeat name). Examples of 
these entries are shown in Figure 2, [see Additional file 1 
also]. The output files are the processed sequences. Each 
entry in the output file has the original sequence entry 
followed by its type and the new repeats if exist in the 
sequence, [see Additional file 2]. 

Files Processing 

Sequence Typing and New Types and Repeats Detec- 
tion: To find a sequence type, for each sequence in the 



sequences' file, after sorting the repeats according to their 
values' lengths, do the following: 

1. For each repeat in the repeats' file, find and replace 
each repeat's value by the repeats number. 

2. Use the regular expression "[-\d\d]+" to collect these 
numbers from the string resulted from step 1. The 
result is a hexadecimal-like string of numbers, see 
type value in Figure 2. 

3. Look up the found result from step 2 in the types' file 
to find the sequence type. If no type is found, then 
this type is marked as new type in the result file. 

Please note that, the "\d" in the regular expression 
"[-\d\d]+" can be rewritten as "[0-9]". The "+" sign at 
the end of this expression means that we are looking for 
1 or more repetition of two digits prefixed by a dash 
("[-\d\d]"). 

To detect new repeats, the following regular expressions 
are used [6]: 

• "AAAGAAGA [ ACGT-] {4} AA [ ACGT-] { 1 } 
AA[ACGT-]{l}[ACGT-]{l,4}CC[ACGT-]{4}" 

• "GAGGAAGA[ACGT-]{4}AA[ACGT-]{1} 
AA[ACGT-]{l}[ACGT-]{l,4}CC[ACGT-]{4}" 

Any sub-sequence in the resulted sequence, from step 
1 above, that complies with these regular expressions is 
marked as new repeat in the result file. 

For more algorithmic convenience and better under- 
standing, let us use lists instead of files. In this con- 
text, let S = {siY^ 1 , R = {r^lf , and T = {t k } k k =\ be the 
sequences, repeats, and types lists, respectively. In addi- 
tion, let us define rj(x) and v(x) as the functions of the 
name and value of x, respectively, where x e {s, r, t}. Our 
objective is to assign a type for each s and to detect the 
new repeats R* and types T* in S. 

Before explaining the proposed method, let us denote 
by t(s) to the type of a given sequence s. The algo- 
rithm of performing the spa typing tasks is shown in 
Algorithm 1. In this algorithm, the function sort (R) sorts 
the repeats' list R according to their values' lengths, so 
the repeat of the longest value is tested in the begin- 
ning to assure that a complete repeat value is replaced by 
the Replace function. The Replace(v(r), vis), rj(r)) func- 
tion searches and replaces the repeat's value v(r) in the 
sequence's value v(s) by the repeat's name (only the num- 
ber part of the name is used, see Figure 2) rj (r). The back- 
bone of the algorithm is the regular expression compiler 
(performed by Java built-in well-established classes), it 
is denoted here by the function MatchRegEx(regexp,str). 
This function finds any substring in the string str 
that fits the regular expression regexp. The func- 
tion zAc\{element,list) adds element (if is not existing) 
to list. 



AL-Tam etal. BMC Research Notes 201 2, 5:642 
http://www.biomedcentral.eom/1756-0500/5/642 



Page 3 of 5 



do 



Algorithm L SEQUENCEPROCESSING(S, R, T) 

R* <r- NULL 
T* <- NULL 
sort(R) 

for each sgS 



comment : Sequence Type Detection 
foreach r e R 

do {Replace (v(r), v(s), ^(r)) 
m^fc/z MatchRegEx("[-\d\d]+", v(s)) 
newtypeFlag true 
foreach t e T 

if match = v(£) 

' T(5) <- 1/(0 

then « newtypeFlag false 
break 

comment New Types Detection 
if newtypeFlag = true 

add (match, I*) 
t(s) match 
comment : New Repeats Detection 
newrMatch\ MatchRegEx( 
"AAAGAAGA [ACGT-] {4} 
AA[ACGT-] {l}AA[ACGr-] 
{l}[ACGr-] {l,4}CC[ACGr-] {4}>(s)) 
newrMatch^ <- MatchRegEx( 
U GAGGAAGA[ACGT—] {4} 
AA[ACGT—] {1}AA[ACGT—] 
{1}[ACGT-} {1,4}CC[ACGT-] {4}>(s)) 
ifwewrM^fc/zi y^NULL 
then add(newrM^^c/zi, 7?*) 
else if newrMatch^i ^ NULL 
then add(newrM<2£c/z2> i?*) 
return^, 7?*, T*) 



do 



then 



Forward-reverse Sequence Alignment: This is an addi- 
tional feature (see [10] for more details), which helps the 
user to align the sequences automatically. The input of 
this task composes two sequence files. The first one is 
called forward and the other is called reverse. Initially, 
the reverse sequences is complemented. To find the com- 
plementary of a sequence, flip the letters of the input 
sequences as the following: 



A< 
C< 



T 
G 



then reverse it so that, the left most letter becomes the 
right most one and vise-versa. Finally, a maximum match 
between the forward sequence and the reversed com- 
plementary sequence is searched to find the sequences' 
junction region (middle overlap) and produce one big- 
ger sequence composed of the union of the comple- 
mentary and forward sequences. The overlap region is 
found, by repeatedly, removing one letter from the for- 
ward sequence and check if the reverse complementary 
sequence starts with the remaining part of the forward 
sequence, until a match is found or the whole forward 
sequence has no more letters. 

DNAGear Architecture 

DNAGear is composed of three main modules, each of 
them contains a set of Java classes. These modules are: 

• DNAProject: This module manages the projects. 
Multiple projects can be created and processed by the 
same application instance, each project contains 
three main folders: sequences, basefiles, and results. 
The sequences folder contains the sequence files to 
be processed. The basefiles folder contains the 
repeats and types files. Once the sequences are 
processed, each sequence will have a result file stored 
in the results folder. 

• DNA Workspace: This module is responsible for 
managing the Graphical User Interface (GUI) 
components. These components include, the main 
toolbar, the menus and actions. 

• SequenceProcessing: This is the core module, it 
implements the typing of the sequences and the 
detection of new types and repeats tasks. It is 
composed of different classes, each one handles a 
certain entity. For example, repeat class performs 
repeats loading operation from the repeats file and 
manages the creation of repeat entries' list, to be then 
used during the typing process. 

Results and Discussion 

DNAGear regular expression and string manipulation is 
based on well-developed and optimized classes offered 
by Java. Nevertheless, it was tested on more than 50 S. 
aureus sequences for sequence typing and on more than 
300 sequences for forward-reverse alignment task. The 
results were verified manually. In terms of accuracy, they 
showed a definite accuracy of finding sequences types, 
and detecting new types and repeats. Additionally, The 
biologists who tested it, reported no problem in the soft- 
ware so far. Although, the only existing and commercial 
software (Ridom StaphType) has more graphical features, 
support database management and offers some other 
meta-statistical reports, it is closed source and platform 
dependent. Furthermore, its core sequence typing tasks 
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File Edit View Navigate Run Tools Window Help 

Q d T+ s + V % "0 C 1 0> 



Projects x| E 
* 3 DNASequence 
V t3 basefiles 

B sparepeats.fasta 

LJ spatypes.txt 
v ^ results 

0 Alignement spatyping2.seqrst 

Li alignement spatyping.seqrst 

l_j alignement spatyping3.seqrst 

[j seq consensus spa-l.seqrst 

[j sequences consensus spa manquantes.seqrst 
^ G sequences 



lent spatyping2.fas 



Q alignement spatyping.fas 

Lj alignement spatyping3.txt 

D seq consensus spa- 1 fas 

□ sequences consensus spa manquantes.fas 



| |U Alignement spatyping 2,fas x | 

^SlConsensus 

TAAGACGATCCTTCGGTGAGCAAAGAAATTTTAGCAGAAGCTAAAAAGCTAAACGATGCT 

CAAGCACCAAAAGAGGAAGACAACAACAAACCTGGCAAAGAAGACAACAAGCCTGGTAAA 

GAAGACAACAACAAACCTGGTAAAGAAGACAACAACAAGCCTGGTAAAGAAGACGGCAAC 

AAGCCTGGTAAAGAAGACGGCAACAAGCCTGGTAAAGAAGACGGCAACAAACCTGGCAAA 

GAAGATGGCAACAAGCCTAGTAAAGAAGACGGCAACAAGCCTGGTAAAGAAGACGGCAAC 

GGAGTACATGTCGTTAAACCTGGTGATACAGTAAATGACATTGCAAAAGCA 

>S9Consensus 

TAAGACGATCCTTCGGTGAGCAAAGAAATTTTAGCAGAAGCTAAAAAGTTAAACGATGCT 
CAAGCACCAAAAGAGGAAGACAACAACAAGCCTGGTAAAGAAGACGGCAACAAACCTGGT 
AAAGAAGACAACAAAAAACCTGGCAAAGAAGATGGCAACAAACCTGGTAAAGAAGACAAC 
AAAAAACCTGGCAAAGAAGATGGCAACAAACCTGGTAAAGAAGACAACAAAAAACCTGGT 
AAAGAAGATGGCAACAAACCTGGTAAAGAAGACGGCAACGGAATACATGTCGTTAAACCT 
I GGTGATACAGTAAATGACATTGCAAAAGCA 
>S17Consensus 

TAAGACGATCCTTCGGTGAGCAAAGAAATTTTAGCAGAAGCTAAAAAGTTAAACGATGCT 

CAAGCACCAAAAGAGGAAGACAACAACAAGCCTGGTAAAGAAGACGGCAACAAACCTGGT 

AAAGAAGACAACAAAAAACCTGGCAAAGAAGATGGCAACAAACCTGGTAAAGAAGACAAC 

AAAAAACCTGGCAAAGAAGATGGCAACAAACCTGGTAAAGAAGACAACAAAAAACCTGGT 

AAAGAAGATGGCAACAAACCTGGTAAAGAAGACGGCAACGGAATACATGTCGTTAAACCT 

GGTGATACAGTAAATGACATTGCAAAAGCA 

>S25Consensus 

TAAGACGATCCTTCGGTGAGCAAAGAAATTTTAGCAGAAGCTAAAAAGCTAAACGATGCT 
CAAGCACCAAAAGAGGAAGACAACAACAAACCTGGCAAAGAAGACAACAAGCCTGGTAAA 
GAAGACAACAACAAACCTGGTAAAGAAGACAACAACAAGCCTGGTAAAGAAGACGGCAAC 
AAGCCTGGTAAAGAAGACGGCAACAAGCCTGGTAAAGAAGACGGCAACAAACCTGGCAAA 
GAAGATGGCAACAAGCCTAGTAAAGAAGACGGCAACAAGCCTGGTAAAGAAGACGGCAAC 
GGAGTACATGTCGTTAAACCTGGTGATACAGTAAATGACATTGCAAAAGCA 



a 



Figure 1 DNAGear graphical user interface. DNAGear main window. Project panel in the left, the menu and tool bar are at the top, and the 
working space (where the user can view and edit the files) is in the right. 



are similar to DNAGear. On the other hand, DNAGear has 
the following positive points: 

• it is free and open source 

• supports automatic alignment of the forward-reverse 
sequences. 

• it is platform independent and is shaped with 
different installers suitable for different target 
platforms (Linux, Windows, and MacOS). 

• it performs the three major aspects of typing process. 



• files are managed in project folders, which can be 
added, edited, moved, renamed or deleted easily. For 
example, in addition to adding a sequence file by 
using the designed tool bar button, the user can add it 
by dragging and dropping the file from its location on 
the disk to the sequences folder in the project tree. 

It can also be extended very easily because it is built on a 
set of modules, that are managed by a very good Java mod- 
ules development environment (NetBeans Platform 7.1), 



(a) (b) 



equence Nnumber 



/Delimiter .Repeat Nnumber /Type Number Repeat Number Seperator 



>r number 

_ . _ _ ^t- « a r^r^r^ at {number XX -XX-XX-XX-XX-XX-XX-XX*XX-XX 

CaACaCaU I AblAObbAT ^r' 



Repeat Value Delimiter Type Value 

(c) 

D< 

/ 

>S number Consensus 

AGACGATCCTTCGGTGAGCAAAGAAATTTTAGCAGAAGCTAAAAAGCTAAACGATGCTCAAGCACCAAAAGAGG 

AAGACAACAACAAGCCTGGTAAAGAAGACGGCAACAAACCTGGTAAAGAAGACAACAAAAAACCTGGCAAAGA 

AGATGGCAACAAACCTGGTAAAGAAGACAACAAAAAACCTGGCAAAGAAGATGGCAACAAACCTGGTAAAGAAG 

ACAACAAAAAACCTGGTAAAGAAGATGGCAACAAACCTGGTAAAGAAGACGGCAACGGAATACATGTCGTTAAAC 

CTGGTGATACAGTAAATGACTTGCAAAAGC 

Sequence Value 

Figure 2 Input file's formats, a- repeat entry format, b- type entry format and c- sequence entry format. The (XX) are two numerical decimal digits 
representing a repeat number. 
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which allows developers to add new features as needed 
without deep understanding of the code already designed. 

Conclusions 

A spa typing application has been developed to be freely 
available to the academic purposes. It finds sequence 
types, new types, and new repeats in the Staphylococcus 
aureus sequences. Furthermore, it has a rich and easy- 
to-use GUI environment which allows users to deal with 
as many as they want of files managed in projects. It 
also supports the forward-reverse sequence alignment for 
sequence assembling task. For portability and installa- 
tion easiness, the system is built in Java and on top of 
the NetBeans Platform; therefore, it can run on different 
machines with different platforms. 

Availability and requirements 

• Project name: DNAGear 

• Project home page: 
http://w3.ualg.pt/~hshah/DNAGear/ 

• Operating system(s): Platform independent 

• Programming language: Java and the GUI is designed 
on top of NetBeans Platform 7.1 

• Other requirements: JRE 1.6 or higher 

• License: None 

• Any restrictions to use by non-academics: None 



Additionally, DNAGear has different installers that sup- 
port the target platform. 

Additional files 



Additional file 1 : A sample sequence file. A sample sequence file in 
plain text (.txt) for testing the application. The entries in the file have the 
same sequence formating described earlier. 

Additional file 2: A sample result file. A sample result file in plain text 
(.txt) obtained from processing the sample sequence file (Additional file 1 ) 
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